48 research outputs found

    Algebraic Comparison of Partial Lists in Bioinformatics

    Get PDF
    The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or just within a meta-analysis comparison, instead of one list it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained. Here we introduce a method, based on the algebraic theory of symmetric groups, for studying the variability between lists ("list stability") in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated first on synthetic data in a gene filtering task and then for finding gene profiles on a recent prostate cancer dataset

    An experimental study of the intrinsic stability of random forest variable importance measures

    Get PDF
    BACKGROUND: The stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. The motivation of this study is two-fold. First, we empirically verify the prevalence of intrinsic stability of VIMs over many real-world datasets to highlight that the instability of VIMs does not originate exclusively from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. Second, through Spearman and Pearson tests we comprehensively investigate how different factors influence the intrinsic stability. RESULTS: The experiments are carried out on 19 benchmark datasets with diverse characteristics, including 10 high-dimensional and small-sample gene expression datasets. Experimental results demonstrate the prevalence of intrinsic stability of VIMs. Spearman and Pearson tests on the correlations between intrinsic stability and different factors show that #feature (number of features) and #sample (size of sample) have a coupling effect on the intrinsic stability. The synthetic indictor, #feature/#sample, shows both negative monotonic correlation and negative linear correlation with the intrinsic stability, while OOB accuracy has monotonic correlations with intrinsic stability. This indicates that high-dimensional, small-sample and high complexity datasets may suffer more from intrinsic instability of VIMs. Furthermore, with respect to parameter settings of random forest, a large number of trees is preferred. No significant correlations can be seen between intrinsic stability and other factors. Finally, the magnitude of intrinsic stability is always smaller than that of traditional stability. CONCLUSION: First, the prevalence of intrinsic stability of VIMs demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. This finding gives a better understanding of VIM stability, and may help reduce the instability of VIMs. Second, by investigating the potential factors of intrinsic stability, users would be more aware of the risks and hence more careful when using VIMs, especially on high-dimensional, small-sample and high complexity datasets

    Network deconvolution as a general method to distinguish direct dependencies in networks

    Get PDF
    Recognizing direct relationships between variables connected in a network is a pervasive problem in biological, social and information sciences as correlation-based networks contain numerous indirect relationships. Here we present a general method for inferring direct effects from an observed correlation matrix containing both direct and indirect effects. We formulate the problem as the inverse of network convolution, and introduce an algorithm that removes the combined effect of all indirect paths of arbitrary length in a closed-form solution by exploiting eigen-decomposition and infinite-series sums. We demonstrate the effectiveness of our approach in several network applications: distinguishing direct targets in gene expression regulatory networks; recognizing directly interacting amino-acid residues for protein structure prediction from sequence alignments; and distinguishing strong collaborations in co-authorship social networks using connectivity information alone. In addition to its theoretical impact as a foundational graph theoretic tool, our results suggest network deconvolution is widely applicable for computing direct dependencies in network science across diverse disciplines.National Institutes of Health (U.S.) (grant R01 HG004037)National Institutes of Health (U.S.) (grant HG005639)Swiss National Science Foundation (Fellowship)National Science Foundation (U.S.) (NSF CAREER Award 0644282

    A machine learning approach to predict perceptual decisions: an insight into face pareidolia

    Get PDF
    The perception of an external stimulus not only depends upon the characteristics of the stimulus but is also influenced by the ongoing brain activity prior to its presentation. In this work, we directly tested whether spontaneous electrical brain activities in prestimulus period could predict perceptual outcome in face pareidolia (visualizing face in noise images) on a trial-by-trial basis. Participants were presented with only noise images but with the prior information that some faces would be hidden in these images, while their electrical brain activities were recorded; participants reported their perceptual decision, face or no-face, on each trial. Using differential hemispheric asymmetry features based on large-scale neural oscillations in a machine learning classifier, we demonstrated that prestimulus brain activities could achieve a classification accuracy, discriminating face from no-face perception, of 75% across trials. The time–frequency features representing hemispheric asymmetry yielded the best classification performance, and prestimulus alpha oscillations were found to be mostly involved in predicting perceptual decision. These findings suggest a mechanism of how prior expectations in the prestimulus period may affect post-stimulus decision making

    Meso- and macrozooplankton communities in the Weddell Sea, Antarctica

    Get PDF
    The present paper describes composition and abundance of meso- and macrozooplankton in the epipelagic zone of the Weddell Sea and gives a systematic review of encountered species regarding results of earlier expeditions. Material was sampled from 6 February to 10 March 1983 from RV Polarstern with a RMT 1+8 m (320 and 4500 μm mesh size). In agreement with topography and water mass distribution three distinct communities were defined, clearly separated by cluster analysis: The Southern Shelf Community has lowest abundances (approx. 9000 ind./1000 m3). Euphausia crystallorophias and Metridia gerlachei are predominating. Compared with the low overall abundance the number of regularly occurring species is high (55) due to many neritic forms. Herbivores and omnivores are dominating (58% and 35%). The North-eastern Shelf Community has highest abundances (about 31 000 ind./1000 m3). It is predominated by copepodites I–III of Calanus propinquus and Calanoides acutus (61%). The faunal composition is characterized by both oceanic and neritic species (64). Fine-filter feeders are prevailing (65%). The Oceanic Community has a mean abundance of approximately 23 000 ind./1000 m3, consisting of 61 species. Dominances are not as pronounced as in the shelf communities. Apart from abundant species like Calanus propinquus, Calanoides acutus, Metridia gerlachei, Oithona spp. and Oncaea spp. many typical inhabitants of the Eastwind Drift are encountered. All feeding types have about the same importance in the Oceanic Community
    corecore